通常需要平行语料库来使用BLEU,流星和Bertscore等指标自动评估翻译质量。尽管基于参考的评估范式被广泛用于许多机器翻译任务中,但由于这些语言遭受了语料库的不足,因此很难将其应用于使用低资源语言的翻译。往返翻译提供了一种令人鼓舞的方法来减轻平行语料库的紧急要求,尽管不幸的是,在统计机器翻译时代,没有观察到与转发翻译相关。在本文中,我们首先观察到,正向翻译质量始终与神经机器翻译范围中相应的往返翻译质量相关。然后,我们仔细分析并揭示了统计机器翻译系统上矛盾结果的原因。其次,我们提出了一种简单而有效的回归方法,以根据各种语言对的往返翻译分数(包括非常低的资源语言之间的往返翻译得分)来预测前向翻译得分的性能。我们进行了广泛的实验,以显示1,000多个语言对的预测模型的有效性和鲁棒性。最后,我们测试了有关挑战性设置的方法,例如预测分数:i)在培训中看不见的语言对,ii)在现实世界中,WMT共享任务但在新领域中。广泛的实验证明了我们方法的鲁棒性和效用。我们相信我们的工作将激发有关非常低资源的多语言机器翻译的工作。
translated by 谷歌翻译
如今,由于自然语言生成(NLG)的突破,包括机器翻译,文件摘要,图像标题等NLG模型已被封装在云API中,以满足全球半十亿人,并每次超过一百亿字代工程日。因此,NLG API已经成为许多商业公司中的必不可少的盈利服务。由于财务和智力投资的大量资金,服务提供商采用支付减薪政策,以促进可持续的市场增长。然而,最近的作品表明,云平台遭受了模型提取攻击所施加的财务损失,这旨在模仿受害者服务的功能和效用,从而违反了云API的知识产权(IP)。这项工作通过识别从受害者NLG API中使用水印反应的攻击者来保护NLG API的IP。然而,大多数现有的水印技术不直接适用于NLG API的IP保护。为了弥合这一差距,首先通过对原始输出进行词法修改,为文本生成API提供一种新的水印方法。与竞争性基线相比,我们的水印方法在P值方面实现了更好的可识别性能,具有较少的语义损失。此外,我们的水印比基线更易于理解和直观。最后,实证研究表明我们的方法也适用于来自不同域的疑问,并且对攻击者有效地培训的攻击者,这些攻击者包括少于10℃的水印样本。
translated by 谷歌翻译
机器学习-AS-A-Service(MLAAS)吸引了数百万用户进入其出色的大型型号。尽管以Black-Box API出版,但这些服务背后的宝贵模型仍然容易受到模仿攻击的影响。最近,一系列作品表明,攻击者设法窃取或提取受害者模型。尽管如此,以前的被盗模型都无法胜过原始的黑盒API。在这项工作中,我们进行了无监督的领域适应性和多维分子合奏,以表明攻击者可能会超越受害者,这超出了对模型提取的先前理解。基准数据集和现实世界中的广泛实验验证了模仿者可以成功地超过传输域上的原始黑盒模型。我们认为我们的工作是模仿攻击研究,尤其是在NLP API的研究中的里程碑,因为优越的绩效可能会影响API提供者的防御甚至发布策略。
translated by 谷歌翻译
半监督学习(SSL)在许多应用领域中已经取得了成功,但这种成功经常涉及任务特定的未标记数据的可用性。知识蒸馏(KD)能够有效地优化紧凑的神经网络,当通过新鲜任务特定的未标记数据蒸馏昂贵的网络时,实现了最佳结果。但是,任务特定的未标记数据可能具有挑战性,特别是对于NLP。我们调查使用生成模型在合成未标记数据中的使用,并呈现一个名为“生成,注释和学习(GAL)”的简单和一般框架。语言模型(LM)用于扫描域中的未标记数据。然后,分类器用于注释这样的数据。最后,综合生成和注释的数据用于推进SSL,KD和NLP和表格任务的几次拍摄学习。为了获得强大的任务特定的LM,我们要么微调来自特定任务的输入的大LM,或者提示具有少数输入示例的大型LM,并且有条件地生成更明显的示例。它还为胶水排行榜上的6层变压器产生了一种新的最先进的。最后,使用GAL的自我训练从UCI存储库的四个表格任务上提供大的收益。
translated by 谷歌翻译
大数据的收集和可用性,结合预先训练的模型(例如,BERT,XLNET等)的进步,彻底改变了现代自然语言处理任务的预测性能,从文本分类到文本生成。这允许公司通过封装作为API的微调BERT的模型来提供作为服务(MLAAS)的机器学习。但是,基于BERT的API展示了一系列安全性和隐私漏洞。例如,先前的工作通过提取的模型制作的对手示例利用了基于BERT的API的安全问题。然而,通过提取的模型的BERT基API的隐私泄漏问题尚未得到很好的研究。另一方面,由于基于BERT的API的高容量,微调模型易于覆盖,但是可以从提取的模型泄露哪种信息仍然未知。在这项工作中,我们首先介绍有效的模型提取攻击,我们通过仅通过查询有限数量的查询来实际窃取基于BERT的API(目标/受害者模型)。我们进一步开发了有效的属性推理攻击,可以推断基于BERT的API使用的训练数据的敏感属性。我们在各种逼真设置下对基准数据集进行了广泛的实验,验证了基于BERT的API的潜在漏洞。此外,我们展示了两个有希望的防御方法对我们的攻击无效,这需要更有效的防御方法。
translated by 谷歌翻译
Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译
Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.
translated by 谷歌翻译
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
translated by 谷歌翻译
Capturing feature information effectively is of great importance in vision tasks. With the development of convolutional neural networks (CNNs), concepts like residual connection and multiple scales promote continual performance gains on diverse deep learning vision tasks. However, the existing methods do not organically combined advantages of these valid ideas. In this paper, we propose a novel CNN architecture called GoogLe2Net, it consists of residual feature-reutilization inceptions (ResFRI) or split residual feature-reutilization inceptions (Split-ResFRI) which create transverse passages between adjacent groups of convolutional layers to enable features flow to latter processing branches and possess residual connections to better process information. Our GoogLe2Net is able to reutilize information captured by foregoing groups of convolutional layers and express multi-scale features at a fine-grained level, which improves performances in image classification. And the inception we proposed could be embedded into inception-like networks directly without any migration costs. Moreover, in experiments based on popular vision datasets, such as CIFAR10 (97.94%), CIFAR100 (85.91%) and Tiny Imagenet (70.54%), we obtain better results on image classification task compared with other modern models.
translated by 谷歌翻译